In [117]:
!pip install tpot
!pip install scikit-learn-intelex
Requirement already satisfied: tpot in /usr/local/lib/python3.7/dist-packages (0.11.7)
Requirement already satisfied: joblib>=0.13.2 in /usr/local/lib/python3.7/dist-packages (from tpot) (1.1.0)
Requirement already satisfied: update-checker>=0.16 in /usr/local/lib/python3.7/dist-packages (from tpot) (0.18.0)
Requirement already satisfied: stopit>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from tpot) (1.1.2)
Requirement already satisfied: deap>=1.2 in /usr/local/lib/python3.7/dist-packages (from tpot) (1.3.1)
Requirement already satisfied: scikit-learn>=0.22.0 in /usr/local/lib/python3.7/dist-packages (from tpot) (1.0.2)
Requirement already satisfied: scipy>=1.3.1 in /usr/local/lib/python3.7/dist-packages (from tpot) (1.4.1)
Requirement already satisfied: pandas>=0.24.2 in /usr/local/lib/python3.7/dist-packages (from tpot) (1.3.5)
Requirement already satisfied: xgboost>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from tpot) (1.5.2)
Requirement already satisfied: numpy>=1.16.3 in /usr/local/lib/python3.7/dist-packages (from tpot) (1.21.5)
Requirement already satisfied: tqdm>=4.36.1 in /usr/local/lib/python3.7/dist-packages (from tpot) (4.63.0)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.2->tpot) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.2->tpot) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas>=0.24.2->tpot) (1.15.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.22.0->tpot) (3.1.0)
Requirement already satisfied: requests>=2.3.0 in /usr/local/lib/python3.7/dist-packages (from update-checker>=0.16->tpot) (2.23.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (2021.10.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (3.0.4)
Requirement already satisfied: scikit-learn-intelex in /usr/local/lib/python3.7/dist-packages (2021.5.3)
Requirement already satisfied: daal4py==2021.5.3 in /usr/local/lib/python3.7/dist-packages (from scikit-learn-intelex) (2021.5.3)
Requirement already satisfied: scikit-learn>=0.22 in /usr/local/lib/python3.7/dist-packages (from scikit-learn-intelex) (1.0.2)
Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.7/dist-packages (from daal4py==2021.5.3->scikit-learn-intelex) (1.21.5)
Requirement already satisfied: daal==2021.5.3 in /usr/local/lib/python3.7/dist-packages (from daal4py==2021.5.3->scikit-learn-intelex) (2021.5.3)
Requirement already satisfied: tbb==2021.* in /usr/local/lib/python3.7/dist-packages (from daal==2021.5.3->daal4py==2021.5.3->scikit-learn-intelex) (2021.5.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.22->scikit-learn-intelex) (3.1.0)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.22->scikit-learn-intelex) (1.4.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.22->scikit-learn-intelex) (1.1.0)
In [118]:
import pandas as pd
import numpy as np
import math
import sys
import sklearn
import tpot
import plotly
import xgboost
import plotly.express as px
import plotly.graph_objects as go
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import validation_curve
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.tree import plot_tree
from sklearn.manifold import TSNE
from sklearn.metrics import make_scorer
from sklearn import preprocessing
from sklearn import tree
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
from sklearnex import patch_sklearn

patch_sklearn()

data = pd.read_csv("./heart.csv")

data_np = data.to_numpy()

data['target'].value_counts()
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
Out[118]:
1    165
0    138
Name: target, dtype: int64
In [119]:
filtered_data = data[['age', 'sex', 'cp', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target']]

filtered_data
Out[119]:
age sex cp thalach exang oldpeak slope ca thal target
0 63 1 3 150 0 2.3 0 0 1 1
1 37 1 2 187 0 3.5 0 0 2 1
2 41 0 1 172 0 1.4 2 0 2 1
3 56 1 1 178 0 0.8 2 0 2 1
4 57 0 0 163 1 0.6 2 0 2 1
... ... ... ... ... ... ... ... ... ... ...
298 57 0 0 123 1 0.2 1 0 3 0
299 45 1 3 132 0 1.2 1 0 3 0
300 68 1 0 141 0 3.4 1 2 3 0
301 57 1 0 115 1 1.2 1 1 3 0
302 57 0 1 174 0 0.0 1 1 2 0

303 rows × 10 columns

In [120]:
train, test = train_test_split(filtered_data, test_size=0.2)
train_X = train.drop('target', axis=1)
train_y = train['target']
test_X = test.drop('target', axis=1)
test_y = test['target']

scaler = preprocessing.StandardScaler().fit(filtered_data.drop('target', axis=1))

normalized_train, normalized_test = train_test_split(filtered_data, test_size=0.2)
normalized_train_X = normalized_train.drop('target', axis=1)
normalized_train_X = scaler.transform(normalized_train_X)
normalized_train_y = normalized_train['target']
normalized_test_X = normalized_test.drop('target', axis=1)
normalized_test_X = scaler.transform(normalized_test_X)
normalized_test_y = normalized_test['target']

normalized_train_X, normalized_train_y
Out[120]:
(array([[ 0.62133012, -1.46841752,  1.97312292, ...,  0.97635214,
         -0.71442887, -0.51292188],
        [ 0.5110413 ,  0.68100522, -0.93851463, ..., -2.27457861,
         -0.71442887,  1.12302895],
        [ 1.50364073,  0.68100522,  1.00257707, ..., -0.64911323,
         -0.71442887,  1.12302895],
        ...,
        [ 0.9521966 , -1.46841752, -0.93851463, ..., -0.64911323,
          2.22410436,  1.12302895],
        [-1.47415758, -1.46841752,  1.00257707, ...,  0.97635214,
         -0.71442887, -0.51292188],
        [-0.04040284,  0.68100522,  1.00257707, ..., -2.27457861,
          0.26508221, -0.51292188]]), 147    1
 195    0
 203    0
 191    0
 103    1
       ..
 28     1
 257    0
 220    0
 122    1
 33     1
 Name: target, Length: 242, dtype: int64)
In [121]:
def fitness(y_true, y_pred):
    yt = y_true.to_numpy()

    accuracy = float(sum(y_pred == yt)) / len(yt)
    truePositive = sum([1 if y_pred[i] == 1 and yt[i] == 1 else 0 for i in range(len(y_pred))])
    trueNegative = sum([1 if y_pred[i] == 0 and yt[i] == 0 else 0 for i in range(len(y_pred))])
    falsePositive = sum([1 if y_pred[i] == 1 and yt[i] == 0 else 0 for i in range(len(y_pred))])
    falseNegative = sum([1 if y_pred[i] == 0 and yt[i] == 1 else 0 for i in range(len(y_pred))])

    precision = truePositive / (truePositive + falsePositive + 1e-4)
    recall = truePositive / (truePositive + falseNegative + 1e-4)

    accuracies.append(accuracy)
    precisions.append(precision)
    recalls.append(recall)

    return accuracy + precision + recall

scorer = make_scorer(fitness, greater_is_better=True)

Logistic Regression

In [122]:
accuracies = []
precisions = []
recalls = []

parameters = {'tol': list(np.random.uniform(1e-6, 100.0, 10)),
              'C': list(np.random.uniform(1e-6, 100.0, 10)),
              'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
               
tpot_classifier = TPOTClassifier(template='Classifier',
                                 generations=50, population_size=50, offspring_size=12,
                                 verbosity=0, early_stop=12,
                                 config_dict={'sklearn.linear_model.LogisticRegression': parameters}, 
                                 cv=5, scoring=scorer)
tpot_classifier.fit(train_X, train_y)

fig = go.Figure()
fig.add_trace(go.Scatter(x=np.array(range(len(accuracies))), y=np.array(accuracies),
                    mode='lines',
                    name='accuracy'))
fig.add_trace(go.Scatter(x=np.array(range(len(precisions))), y=np.array(precisions),
                    mode='lines',
                    name='precision'))
fig.add_trace(go.Scatter(x=np.array(range(len(recalls))), y=np.array(recalls),
                    mode='lines', name='recall'))

fig.show(renderer="notebook")
In [123]:
model_logistic = tpot_classifier.fitted_pipeline_.steps[0][1]
print(model_logistic)
print('Model with optimized hyperparameters: ', fitness(test_y, model_logistic.predict(test_X)))
print('Test Accuracy: ', model_logistic.score(test_X, test_y))
print('Train Accuracy: ', model_logistic.score(train_X, train_y))

clf = LogisticRegression(random_state=0).fit(train_X, train_y)

print('Model without optimized hyperparameters: ', fitness(test_y, clf.predict(test_X)))
print('Test Accuracy: ', clf.score(test_X, test_y))
print('Train Accuracy: ', clf.score(train_X, train_y))

print(classification_report(test_y, model_logistic.predict(test_X)))

fig, ax = plt.subplots(figsize=(7, 7))

confusion_matrix = ConfusionMatrixDisplay.from_estimator(
    model_logistic,
    test_X,
    test_y,
    display_labels=['No heart disease', 'Heart disease'],
    cmap=plt.cm.Blues,
    ax=ax
)

plt.show()
LogisticRegression(C=7.712509409575325, solver='newton-cg',
                   tol=9.626032267265142)
Model with optimized hyperparameters:  2.7516347192738575
Test Accuracy:  0.9016393442622951
Train Accuracy:  0.8429752066115702
Model without optimized hyperparameters:  2.708318207864056
Test Accuracy:  0.8852459016393442
Train Accuracy:  0.8553719008264463
              precision    recall  f1-score   support

           0       0.86      0.86      0.86        21
           1       0.93      0.93      0.93        40

    accuracy                           0.90        61
   macro avg       0.89      0.89      0.89        61
weighted avg       0.90      0.90      0.90        61

/usr/local/lib/python3.7/dist-packages/daal4py/sklearn/linear_model/logistic_path.py:463: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

In [124]:
X_embedded = TSNE(n_components=2, learning_rate='auto', init='random').fit_transform(train_X)
model_logistic.fit(X_embedded, train_y)

fig, ax = plt.subplots()
plot_decision_regions(X_embedded, train_y.to_numpy(), clf=model_logistic, legend=2, ax=ax)

fig.suptitle('Logistic Regression on heart disease')
plt.show()
/usr/local/lib/python3.7/dist-packages/mlxtend/plotting/decision_regions.py:244: MatplotlibDeprecationWarning:

Passing unsupported keyword arguments to axis() will raise a TypeError in 3.3.

Decision Trees

In [125]:
accuracies = []
precisions = []
recalls = []

parameters = {'criterion': ['gini', 'entropy'],
              'splitter': ['best', 'random'],
              'min_impurity_decrease': list(np.linspace(0, 0.1, 20)),
              'ccp_alpha': list(np.linspace(0, 0.1, 20))}
               
tpot_classifier = TPOTClassifier(template='Classifier',
                                 generations=50, population_size=50, offspring_size=30,
                                 verbosity=0, early_stop=50,
                                 config_dict={'sklearn.tree.DecisionTreeClassifier': parameters}, 
                                 cv=5, scoring=scorer)
tpot_classifier.fit(train_X, train_y)

fig = go.Figure()
fig.add_trace(go.Scatter(x=np.array(range(len(accuracies))), y=np.array(accuracies),
                    mode='lines',
                    name='accuracy'))
fig.add_trace(go.Scatter(x=np.array(range(len(precisions))), y=np.array(precisions),
                    mode='lines',
                    name='precision'))
fig.add_trace(go.Scatter(x=np.array(range(len(recalls))), y=np.array(recalls),
                    mode='lines', name='recall'))

fig.show(renderer="notebook")
In [126]:
model_dt = tpot_classifier.fitted_pipeline_.steps[0][1]
print(model_dt)
print('Model with optimized hyperparameters: ', fitness(test_y, model_dt.predict(test_X)))
print('Test Accuracy: ', model_dt.score(test_X, test_y))
print('Train Accuracy: ', model_dt.score(train_X, train_y))

clf = tree.DecisionTreeClassifier(random_state=0).fit(train_X, train_y)

print('Model without optimized hyperparameters: ', fitness(test_y, clf.predict(test_X)))
print('Test Accuracy: ', clf.score(test_X, test_y))
print('Train Accuracy: ', clf.score(train_X, train_y))

print(classification_report(test_y, model_dt.predict(test_X)))

fig, ax = plt.subplots(figsize=(7, 7))

confusion_matrix = ConfusionMatrixDisplay.from_estimator(
    model_dt,
    test_X,
    test_y,
    display_labels=['No heart disease', 'Heart disease'],
    cmap=plt.cm.Blues,
    ax=ax
)

plt.show()
DecisionTreeClassifier(ccp_alpha=0.010526315789473684,
                       min_impurity_decrease=0.005263157894736842)
Model with optimized hyperparameters:  2.447595063515754
Test Accuracy:  0.7868852459016393
Train Accuracy:  0.871900826446281
Model without optimized hyperparameters:  2.3669319965779505
Test Accuracy:  0.7540983606557377
Train Accuracy:  1.0
              precision    recall  f1-score   support

           0       0.65      0.81      0.72        21
           1       0.89      0.78      0.83        40

    accuracy                           0.79        61
   macro avg       0.77      0.79      0.78        61
weighted avg       0.81      0.79      0.79        61

In [127]:
model_dt.feature_names_in_
plt.figure(figsize=(20,20))
plot_tree(model_dt, filled=True, feature_names=model_dt.feature_names_in_, class_names=['No disease', 'Heart disease'])
plt.title("Decision tree")
plt.show()
In [128]:
X_embedded = TSNE(n_components=2, learning_rate='auto', init='random').fit_transform(train_X)
model_dt.fit(X_embedded, train_y)

fig, ax = plt.subplots()
plot_decision_regions(X_embedded, train_y.to_numpy(), clf=model_dt, legend=2, ax=ax)

fig.suptitle('Decision Tree on heart disease')
plt.show()
/usr/local/lib/python3.7/dist-packages/mlxtend/plotting/decision_regions.py:244: MatplotlibDeprecationWarning:

Passing unsupported keyword arguments to axis() will raise a TypeError in 3.3.

Random Forest

In [129]:
accuracies = []
precisions = []
recalls = []

parameters = {'criterion': ['gini', 'entropy'],
              'max_depth': [2, 3],
              'min_impurity_decrease': list(np.linspace(0, 0.1, 20)),
              'n_estimators': list(np.linspace(50, 150, 20, dtype=int)),
              'ccp_alpha': list(np.linspace(0, 0.1, 20)),
              'max_features': ['auto', 'sqrt', 'log2']}
               
tpot_classifier = TPOTClassifier(template='Classifier',
                                 generations=20, population_size=20, offspring_size=10,
                                 verbosity=2, early_stop=20,
                                 config_dict={'sklearn.ensemble.RandomForestClassifier': parameters}, 
                                 cv=5, scoring=scorer)
tpot_classifier.fit(train_X, train_y)

fig = go.Figure()
fig.add_trace(go.Scatter(x=np.array(range(len(accuracies))), y=np.array(accuracies),
                    mode='lines',
                    name='accuracy'))
fig.add_trace(go.Scatter(x=np.array(range(len(precisions))), y=np.array(precisions),
                    mode='lines',
                    name='precision'))
fig.add_trace(go.Scatter(x=np.array(range(len(recalls))), y=np.array(recalls),
                    mode='lines', name='recall'))

fig.show(renderer="notebook")
Generation 1 - Current best internal CV score: 2.5043515952134725

Generation 2 - Current best internal CV score: 2.5130679105526137

Generation 3 - Current best internal CV score: 2.5130679105526137

Generation 4 - Current best internal CV score: 2.5130679105526137

Generation 5 - Current best internal CV score: 2.5130679105526137

Generation 6 - Current best internal CV score: 2.5130679105526137

Generation 7 - Current best internal CV score: 2.5130679105526137

Generation 8 - Current best internal CV score: 2.53879367258495

Generation 9 - Current best internal CV score: 2.53879367258495

Generation 10 - Current best internal CV score: 2.53879367258495

Generation 11 - Current best internal CV score: 2.53879367258495

Generation 12 - Current best internal CV score: 2.53879367258495

Generation 13 - Current best internal CV score: 2.53879367258495

Generation 14 - Current best internal CV score: 2.53879367258495

Generation 15 - Current best internal CV score: 2.53879367258495

Generation 16 - Current best internal CV score: 2.53879367258495

Generation 17 - Current best internal CV score: 2.53879367258495

Generation 18 - Current best internal CV score: 2.53879367258495

Generation 19 - Current best internal CV score: 2.53879367258495

Generation 20 - Current best internal CV score: 2.53879367258495

Best pipeline: RandomForestClassifier(input_matrix, ccp_alpha=0.0, criterion=gini, max_depth=2, max_features=log2, min_impurity_decrease=0.015789473684210527, n_estimators=107)
In [130]:
model_rf = tpot_classifier.fitted_pipeline_.steps[0][1]
print('Model with optimized hyperparameters: ', fitness(test_y, model_rf.predict(test_X)))
print('Test Accuracy: ', model_rf.score(test_X, test_y))
print('Train Accuracy: ', model_rf.score(train_X, train_y))

clf = sklearn.ensemble.RandomForestClassifier(random_state=0).fit(train_X, train_y)

print('Model without optimized hyperparameters: ', fitness(test_y, clf.predict(test_X)))
print('Test Accuracy: ', clf.score(test_X, test_y))
print('Train Accuracy: ', clf.score(train_X, train_y))

print(classification_report(test_y, model_rf.predict(test_X)))

fig, ax = plt.subplots(figsize=(7, 7))

confusion_matrix = ConfusionMatrixDisplay.from_estimator(
    model_rf,
    test_X,
    test_y,
    display_labels=['No heart disease', 'Heart disease'],
    cmap=plt.cm.Blues,
    ax=ax
)

plt.show()
Model with optimized hyperparameters:  2.7516347192738575
Test Accuracy:  0.9016393442622951
Train Accuracy:  0.8512396694214877
Model without optimized hyperparameters:  2.624890425223023
Test Accuracy:  0.8524590163934426
Train Accuracy:  1.0
              precision    recall  f1-score   support

           0       0.86      0.86      0.86        21
           1       0.93      0.93      0.93        40

    accuracy                           0.90        61
   macro avg       0.89      0.89      0.89        61
weighted avg       0.90      0.90      0.90        61

In [131]:
X_embedded = TSNE(n_components=2, learning_rate='auto', init='random').fit_transform(train_X)
model_rf.fit(X_embedded, train_y)

fig, ax = plt.subplots()
plot_decision_regions(X_embedded, train_y.to_numpy(), clf=model_rf, legend=2, ax=ax)

fig.suptitle('Decision Tree on heart disease')
plt.show()
/usr/local/lib/python3.7/dist-packages/mlxtend/plotting/decision_regions.py:244: MatplotlibDeprecationWarning:

Passing unsupported keyword arguments to axis() will raise a TypeError in 3.3.

XGBoost

In [132]:
accuracies = []
precisions = []
recalls = []

parameters = {'learning_rate': list(np.linspace(0.1, 2, 20)),
              'use_label_encoder': [False],
              'validate_parameters': [False],
              'disable_default_eval_metric': [True],
              'subsample': list(np.linspace(0, 1, 20)),
              'colsample_bynode': list(np.linspace(0, 1, 20)),
              'reg_lambda': list(np.linspace(0, 0.1, 20))}
               
tpot_classifier = TPOTClassifier(template='Classifier',
                                 generations=20, population_size=20, offspring_size=10,
                                 verbosity=2, early_stop=5,
                                 config_dict={'xgboost.XGBClassifier': parameters}, 
                                 cv=5, scoring=scorer)
tpot_classifier.fit(train_X, train_y)

fig = go.Figure()
fig.add_trace(go.Scatter(x=np.array(range(len(accuracies))), y=np.array(accuracies),
                    mode='lines',
                    name='accuracy'))
fig.add_trace(go.Scatter(x=np.array(range(len(precisions))), y=np.array(precisions),
                    mode='lines',
                    name='precision'))
fig.add_trace(go.Scatter(x=np.array(range(len(recalls))), y=np.array(recalls),
                    mode='lines', name='recall'))

fig.show(renderer="notebook")
Generation 1 - Current best internal CV score: 2.4498096168797834

Generation 2 - Current best internal CV score: 2.45551310088904

Generation 3 - Current best internal CV score: 2.4690608856247196

Generation 4 - Current best internal CV score: 2.4690608856247196

Generation 5 - Current best internal CV score: 2.4838680530841084

Generation 6 - Current best internal CV score: 2.4838680530841084

Generation 7 - Current best internal CV score: 2.4875187768322378

Generation 8 - Current best internal CV score: 2.4885155969281163

Generation 9 - Current best internal CV score: 2.4885155969281163

Generation 10 - Current best internal CV score: 2.4885155969281163

Generation 11 - Current best internal CV score: 2.499255445781983

Generation 12 - Current best internal CV score: 2.499255445781983

Generation 13 - Current best internal CV score: 2.523994957620647

Generation 14 - Current best internal CV score: 2.523994957620647

Generation 15 - Current best internal CV score: 2.523994957620647

Generation 16 - Current best internal CV score: 2.523994957620647

Generation 17 - Current best internal CV score: 2.529224645240536

Generation 18 - Current best internal CV score: 2.529224645240536

Generation 19 - Current best internal CV score: 2.529224645240536

Generation 20 - Current best internal CV score: 2.529224645240536

Best pipeline: XGBClassifier(input_matrix, colsample_bynode=0.7894736842105263, disable_default_eval_metric=True, learning_rate=0.3, reg_lambda=0.042105263157894736, subsample=0.15789473684210525, use_label_encoder=False, validate_parameters=False)
In [133]:
model_xg = tpot_classifier.fitted_pipeline_.steps[0][1]
print(model_xg)
print('Model with optimized hyperparameters: ', fitness(test_y, model_xg.predict(test_X)))
print('Test Accuracy: ', model_xg.score(test_X, test_y))
print('Train Accuracy: ', model_xg.score(train_X, train_y))

clf = xgboost.XGBClassifier().fit(train_X, train_y)

print('Model without optimized hyperparameters: ', fitness(test_y, clf.predict(test_X)))
print('Test Accuracy: ', clf.score(test_X, test_y))
print('Train Accuracy: ', clf.score(train_X, train_y))

print(classification_report(test_y, model_xg.predict(test_X)))

fig, ax = plt.subplots(figsize=(7, 7))

confusion_matrix = ConfusionMatrixDisplay.from_estimator(
    model_xg,
    test_X,
    test_y,
    display_labels=['No heart disease', 'Heart disease'],
    cmap=plt.cm.Blues,
    ax=ax
)

plt.show()
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=0.7894736842105263, colsample_bytree=1,
              disable_default_eval_metric=True, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.3, max_delta_step=0,
              max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=2,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=0.042105263157894736, scale_pos_weight=1,
              subsample=0.15789473684210525, tree_method='exact',
              use_label_encoder=False, validate_parameters=False,
              verbosity=None)
Model with optimized hyperparameters:  2.5860611987814295
Test Accuracy:  0.8360655737704918
Train Accuracy:  0.8801652892561983
[18:37:10] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
/usr/local/lib/python3.7/dist-packages/xgboost/sklearn.py:1224: UserWarning:

The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].

Model without optimized hyperparameters:  2.5339532332008208
Test Accuracy:  0.819672131147541
Train Accuracy:  1.0
              precision    recall  f1-score   support

           0       0.76      0.76      0.76        21
           1       0.88      0.88      0.88        40

    accuracy                           0.84        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.84      0.84      0.84        61

In [134]:
X_embedded = TSNE(n_components=2, learning_rate='auto', init='random').fit_transform(train_X)
model_xg.fit(X_embedded, train_y)

fig, ax = plt.subplots()
plot_decision_regions(X_embedded, train_y.to_numpy(), clf=model_xg, legend=2, ax=ax)

fig.suptitle('XGBoost on heart disease')
plt.show()
/usr/local/lib/python3.7/dist-packages/mlxtend/plotting/decision_regions.py:244: MatplotlibDeprecationWarning:

Passing unsupported keyword arguments to axis() will raise a TypeError in 3.3.

Naive Bayes

In [135]:
accuracies = []
precisions = []
recalls = []

parameters = {'alpha': list(np.linspace(0.1, 2, 20)),
              'norm': [False, True],}
               
tpot_classifier = TPOTClassifier(template='Classifier',
                                 generations=20, population_size=100, offspring_size=20,
                                 verbosity=2, early_stop=100,
                                 config_dict={'sklearn.naive_bayes.ComplementNB': parameters}, 
                                 cv=5, scoring=scorer)
tpot_classifier.fit(train_X, train_y)

fig = go.Figure()
fig.add_trace(go.Scatter(x=np.array(range(len(accuracies))), y=np.array(accuracies),
                    mode='lines',
                    name='accuracy'))
fig.add_trace(go.Scatter(x=np.array(range(len(precisions))), y=np.array(precisions),
                    mode='lines',
                    name='precision'))
fig.add_trace(go.Scatter(x=np.array(range(len(recalls))), y=np.array(recalls),
                    mode='lines', name='recall'))

fig.show(renderer="notebook")
Generation 1 - Current best internal CV score: 2.3540054361437157

Generation 2 - Current best internal CV score: 2.3540054361437157

Generation 3 - Current best internal CV score: 2.3540054361437157

Generation 4 - Current best internal CV score: 2.3540054361437157

Generation 5 - Current best internal CV score: 2.3540054361437157

Generation 6 - Current best internal CV score: 2.3540054361437157

Generation 7 - Current best internal CV score: 2.3540054361437157

Generation 8 - Current best internal CV score: 2.3540054361437157

Generation 9 - Current best internal CV score: 2.3540054361437157

Generation 10 - Current best internal CV score: 2.3540054361437157

Generation 11 - Current best internal CV score: 2.3540054361437157

Generation 12 - Current best internal CV score: 2.3540054361437157

Generation 13 - Current best internal CV score: 2.3540054361437157

Generation 14 - Current best internal CV score: 2.3540054361437157

Generation 15 - Current best internal CV score: 2.3540054361437157

Generation 16 - Current best internal CV score: 2.3540054361437157

Generation 17 - Current best internal CV score: 2.3540054361437157

Generation 18 - Current best internal CV score: 2.3540054361437157

Generation 19 - Current best internal CV score: 2.3540054361437157

Generation 20 - Current best internal CV score: 2.3540054361437157

Best pipeline: ComplementNB(input_matrix, alpha=1.8, norm=False)
In [136]:
model_nb = tpot_classifier.fitted_pipeline_.steps[0][1]
print(model_nb)
print('Model with optimized hyperparameters: ', fitness(test_y, model_nb.predict(test_X)))
print('Test Accuracy: ', model_nb.score(test_X, test_y))
print('Train Accuracy: ', model_nb.score(train_X, train_y))

clf = sklearn.naive_bayes.ComplementNB().fit(train_X, train_y)

print('Model without optimized hyperparameters: ', fitness(test_y, clf.predict(test_X)))
print('Test Accuracy: ', clf.score(test_X, test_y))
print('Train Accuracy: ', clf.score(train_X, train_y))

print(classification_report(test_y, model_nb.predict(test_X)))

fig, ax = plt.subplots(figsize=(7, 7))

confusion_matrix = ConfusionMatrixDisplay.from_estimator(
    model_nb,
    test_X,
    test_y,
    display_labels=['No heart disease', 'Heart disease'],
    cmap=plt.cm.Blues,
    ax=ax
)

plt.show()
ComplementNB(alpha=1.8)
Model with optimized hyperparameters:  2.6688479590276435
Test Accuracy:  0.8688524590163934
Train Accuracy:  0.7768595041322314
Model without optimized hyperparameters:  2.6688479590276435
Test Accuracy:  0.8688524590163934
Train Accuracy:  0.7768595041322314
              precision    recall  f1-score   support

           0       0.81      0.81      0.81        21
           1       0.90      0.90      0.90        40

    accuracy                           0.87        61
   macro avg       0.85      0.85      0.85        61
weighted avg       0.87      0.87      0.87        61

In [137]:
X_embedded = TSNE(n_components=2, learning_rate='auto', init='random').fit_transform(train_X)
model_nb.fit(X_embedded - X_embedded.min(), train_y)

fig, ax = plt.subplots()
plot_decision_regions(X_embedded, train_y.to_numpy(), clf=model_nb, legend=2, ax=ax)

fig.suptitle('Naive Bayes on heart disease')
plt.show()
/usr/local/lib/python3.7/dist-packages/mlxtend/plotting/decision_regions.py:244: MatplotlibDeprecationWarning:

Passing unsupported keyword arguments to axis() will raise a TypeError in 3.3.

k-NN

In [138]:
accuracies = []
precisions = []
recalls = []

parameters = {'n_neighbors': [2, 3, 4, 5, 6, 7, 8, 9],
              'weights': ['uniform', 'distance'],
              'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
              }
               
tpot_classifier = TPOTClassifier(template='Classifier',
                                 generations=20, population_size=100, offspring_size=20,
                                 verbosity=2, early_stop=100,
                                 config_dict={'sklearn.neighbors.KNeighborsClassifier': parameters}, 
                                 cv=5, scoring=scorer)
tpot_classifier.fit(train_X, train_y)

fig = go.Figure()
fig.add_trace(go.Scatter(x=np.array(range(len(accuracies))), y=np.array(accuracies),
                    mode='lines',
                    name='accuracy'))
fig.add_trace(go.Scatter(x=np.array(range(len(precisions))), y=np.array(precisions),
                    mode='lines',
                    name='precision'))
fig.add_trace(go.Scatter(x=np.array(range(len(recalls))), y=np.array(recalls),
                    mode='lines', name='recall'))

fig.show(renderer="notebook")
Generation 1 - Current best internal CV score: 2.135904694521783

Generation 2 - Current best internal CV score: 2.135904694521783

Generation 3 - Current best internal CV score: 2.135904694521783

Generation 4 - Current best internal CV score: 2.135904694521783

Generation 5 - Current best internal CV score: 2.135904694521783

Generation 6 - Current best internal CV score: 2.135904694521783

Generation 7 - Current best internal CV score: 2.135904694521783

Generation 8 - Current best internal CV score: 2.135904694521783

Generation 9 - Current best internal CV score: 2.135904694521783

Generation 10 - Current best internal CV score: 2.135904694521783

Generation 11 - Current best internal CV score: 2.135904694521783

Generation 12 - Current best internal CV score: 2.135904694521783

Generation 13 - Current best internal CV score: 2.135904694521783

Generation 14 - Current best internal CV score: 2.135904694521783

Generation 15 - Current best internal CV score: 2.135904694521783

Generation 16 - Current best internal CV score: 2.135904694521783

Generation 17 - Current best internal CV score: 2.135904694521783

Generation 18 - Current best internal CV score: 2.135904694521783

Generation 19 - Current best internal CV score: 2.135904694521783

Generation 20 - Current best internal CV score: 2.135904694521783

Best pipeline: KNeighborsClassifier(input_matrix, algorithm=auto, n_neighbors=5, weights=uniform)
In [139]:
model_knn = tpot_classifier.fitted_pipeline_.steps[0][1]
print(model_knn)
print('Model with optimized hyperparameters: ', fitness(test_y, model_knn.predict(test_X)))
print('Test Accuracy: ', model_knn.score(test_X, test_y))
print('Train Accuracy: ', model_knn.score(train_X, train_y))

clf = sklearn.neighbors.KNeighborsClassifier().fit(train_X, train_y)

print('Model without optimized hyperparameters: ', fitness(test_y, clf.predict(test_X)))
print('Test Accuracy: ', clf.score(test_X, test_y))
print('Train Accuracy: ', clf.score(train_X, train_y))

print(classification_report(test_y, model_knn.predict(test_X)))

fig, ax = plt.subplots(figsize=(7, 7))

confusion_matrix = ConfusionMatrixDisplay.from_estimator(
    model_knn,
    test_X,
    test_y,
    display_labels=['No heart disease', 'Heart disease'],
    cmap=plt.cm.Blues,
    ax=ax
)

plt.show()
KNeighborsClassifier()
Model with optimized hyperparameters:  2.312700433670707
Test Accuracy:  0.7377049180327869
Train Accuracy:  0.756198347107438
Model without optimized hyperparameters:  2.312700433670707
Test Accuracy:  0.7377049180327869
Train Accuracy:  0.756198347107438
              precision    recall  f1-score   support

           0       0.59      0.81      0.68        21
           1       0.88      0.70      0.78        40

    accuracy                           0.74        61
   macro avg       0.73      0.75      0.73        61
weighted avg       0.78      0.74      0.74        61

In [140]:
X_embedded = TSNE(n_components=2, learning_rate='auto', init='random').fit_transform(train_X)
model_knn.fit(X_embedded - X_embedded.min(), train_y)

fig, ax = plt.subplots()
plot_decision_regions(X_embedded, train_y.to_numpy(), clf=model_knn, legend=2, ax=ax)

fig.suptitle('Naive Bayes on heart disease')
plt.show()
/usr/local/lib/python3.7/dist-packages/mlxtend/plotting/decision_regions.py:244: MatplotlibDeprecationWarning:

Passing unsupported keyword arguments to axis() will raise a TypeError in 3.3.

NeuralNetworks

In [141]:
accuracies = []
precisions = []
recalls = []

parameters = {'hidden_layer_sizes': [(100), (64), (32), (128), (256)],
              'activation': ['identity', 'logistic', 'tanh', 'relu'],
              'solver': ['lbfgs', 'sgd', 'adam'],
              'learning_rate': ['constant', 'invscaling', 'adaptive'],
              'alpha': list(np.linspace(0, 0.1, 30)),
              'learning_rate_init': list(np.linspace(0, 0.1, 30)),
              'power_t': list(np.linspace(0, 1, 30)),
              'momentum': list(np.linspace(0.5, 1, 30)),
              'nesterovs_momentum': [True, False]
              }
               
tpot_classifier = TPOTClassifier(template='Classifier',
                                 generations=10, population_size=50, offspring_size=10,
                                 verbosity=2, early_stop=2,
                                 config_dict={'sklearn.neural_network.MLPClassifier': parameters}, 
                                 cv=5, scoring=scorer)
tpot_classifier.fit(normalized_train_X, normalized_train_y)

fig = go.Figure()
fig.add_trace(go.Scatter(x=np.array(range(len(accuracies))), y=np.array(accuracies),
                    mode='lines',
                    name='accuracy'))
fig.add_trace(go.Scatter(x=np.array(range(len(precisions))), y=np.array(precisions),
                    mode='lines',
                    name='precision'))
fig.add_trace(go.Scatter(x=np.array(range(len(recalls))), y=np.array(recalls),
                    mode='lines', name='recall'))

fig.show(renderer="notebook")
Generation 1 - Current best internal CV score: 2.622746034633967

Generation 2 - Current best internal CV score: 2.622746034633967

Generation 3 - Current best internal CV score: 2.622746034633967

The optimized pipeline was not improved after evaluating 2 more generations. Will end the optimization process.

TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: MLPClassifier(input_matrix, activation=relu, alpha=0.08620689655172414, hidden_layer_sizes=100, learning_rate=constant, learning_rate_init=0.055172413793103454, momentum=0.8620689655172413, nesterovs_momentum=False, power_t=0.20689655172413793, solver=sgd)
In [142]:
model_nn = tpot_classifier.fitted_pipeline_.steps[0][1]
model_nn.fit(normalized_train_X, normalized_train_y)
print(model_nn)
print('Model with optimized hyperparameters: ', fitness(normalized_test_y, model_nn.predict(normalized_test_X)))
print('Test Accuracy: ', model_nn.score(normalized_test_X, normalized_test_y))
print('Train Accuracy: ', model_nn.score(normalized_train_X, normalized_train_y))

clf = sklearn.neural_network.MLPClassifier().fit(normalized_train_X, normalized_train_y)

print('Model without optimized hyperparameters: ', fitness(normalized_test_y, clf.predict(normalized_test_X)))
print('Test Accuracy: ', clf.score(normalized_test_X, normalized_test_y))
print('Train Accuracy: ', clf.score(normalized_train_X, normalized_train_y))

print(classification_report(normalized_test_y, model_nn.predict(normalized_test_X)))

fig, ax = plt.subplots(figsize=(7, 7))

confusion_matrix = ConfusionMatrixDisplay.from_estimator(
    model_nn,
    test_X,
    test_y,
    display_labels=['No heart disease', 'Heart disease'],
    cmap=plt.cm.Blues,
    ax=ax
)

plt.show()
/usr/local/lib/python3.7/dist-packages/sklearn/neural_network/_multilayer_perceptron.py:696: ConvergenceWarning:

Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.

MLPClassifier(alpha=0.08620689655172414, hidden_layer_sizes=100,
              learning_rate_init=0.055172413793103454,
              momentum=0.8620689655172413, nesterovs_momentum=False,
              power_t=0.20689655172413793, solver='sgd')
Model with optimized hyperparameters:  2.3350279317920206
Test Accuracy:  0.7868852459016393
Train Accuracy:  0.9586776859504132
/usr/local/lib/python3.7/dist-packages/sklearn/neural_network/_multilayer_perceptron.py:696: ConvergenceWarning:

Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.

/usr/local/lib/python3.7/dist-packages/sklearn/base.py:444: UserWarning:

X has feature names, but MLPClassifier was fitted without feature names

Model without optimized hyperparameters:  2.2318709244601926
Test Accuracy:  0.7540983606557377
Train Accuracy:  0.9173553719008265
              precision    recall  f1-score   support

           0       0.76      0.84      0.80        31
           1       0.81      0.73      0.77        30

    accuracy                           0.79        61
   macro avg       0.79      0.79      0.79        61
weighted avg       0.79      0.79      0.79        61

In [143]:
X_embedded = TSNE(n_components=2, learning_rate='auto', init='random').fit_transform(train_X)
model_nn.fit(X_embedded - X_embedded.min(), train_y)

fig, ax = plt.subplots()
plot_decision_regions(X_embedded, train_y.to_numpy(), clf=model_nn, legend=2, ax=ax)

fig.suptitle('NN on heart disease')
plt.show()
/usr/local/lib/python3.7/dist-packages/mlxtend/plotting/decision_regions.py:244: MatplotlibDeprecationWarning:

Passing unsupported keyword arguments to axis() will raise a TypeError in 3.3.

SVM

In [144]:
accuracies = []
precisions = []
recalls = []

parameters = {
              'C': list(np.linspace(0, 2, 30)),
              'coef0': list(np.linspace(0, 1, 30)),
              'kernel': ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed'],
              'degree': [2, 3, 4],
              'gamma': ['scale', 'auto'],
              }
               
tpot_classifier = TPOTClassifier(template='Classifier',
                                 generations=10, population_size=50, offspring_size=50,
                                 verbosity=2, early_stop=10,
                                 config_dict={'sklearn.svm.SVC': parameters}, 
                                 cv=5, scoring=scorer)
tpot_classifier.fit(normalized_train_X, normalized_train_y)

fig = go.Figure()
fig.add_trace(go.Scatter(x=np.array(range(len(accuracies))), y=np.array(accuracies),
                    mode='lines',
                    name='accuracy'))
fig.add_trace(go.Scatter(x=np.array(range(len(precisions))), y=np.array(precisions),
                    mode='lines',
                    name='precision'))
fig.add_trace(go.Scatter(x=np.array(range(len(recalls))), y=np.array(recalls),
                    mode='lines', name='recall'))

fig.show(renderer="notebook")
Generation 1 - Current best internal CV score: 2.6185074793209124

Generation 2 - Current best internal CV score: 2.6185074793209124

Generation 3 - Current best internal CV score: 2.6185074793209124

Generation 4 - Current best internal CV score: 2.6185074793209124

Generation 5 - Current best internal CV score: 2.6185074793209124

Generation 6 - Current best internal CV score: 2.6185074793209124

Generation 7 - Current best internal CV score: 2.6185074793209124

Generation 8 - Current best internal CV score: 2.6185074793209124

Generation 9 - Current best internal CV score: 2.6185074793209124

Generation 10 - Current best internal CV score: 2.6185074793209124

Best pipeline: SVC(input_matrix, C=0.13793103448275862, coef0=0.27586206896551724, degree=2, gamma=auto, kernel=linear)
In [145]:
model_svm = tpot_classifier.fitted_pipeline_.steps[0][1]
print(model_svm)
print('Model with optimized hyperparameters: ', fitness(normalized_test_y, model_svm.predict(normalized_test_X)))
print('Test Accuracy: ', model_svm.score(normalized_test_X, normalized_test_y))
print('Train Accuracy: ', model_svm.score(normalized_train_X, normalized_train_y))

clf = sklearn.svm.SVC().fit(normalized_train_X, normalized_train_y)

print('Model without optimized hyperparameters: ', fitness(normalized_test_y, clf.predict(normalized_test_X)))
print('Test Accuracy: ', clf.score(normalized_test_X, normalized_test_y))
print('Train Accuracy: ', clf.score(normalized_train_X, normalized_train_y))

print(classification_report(normalized_test_y, model_svm.predict(normalized_test_X)))

fig, ax = plt.subplots(figsize=(7, 7))

confusion_matrix = ConfusionMatrixDisplay.from_estimator(
    model_svm,
    test_X,
    test_y,
    display_labels=['No heart disease', 'Heart disease'],
    cmap=plt.cm.Blues,
    ax=ax
)

plt.show()
SVC(C=0.13793103448275862, coef0=0.27586206896551724, degree=2, gamma='auto',
    kernel='linear')
Model with optimized hyperparameters:  2.4032733552090346
Test Accuracy:  0.8032786885245902
Train Accuracy:  0.8553719008264463
Model without optimized hyperparameters:  2.4594516981671575
Test Accuracy:  0.819672131147541
Train Accuracy:  0.8966942148760331
              precision    recall  f1-score   support

           0       0.81      0.81      0.81        31
           1       0.80      0.80      0.80        30

    accuracy                           0.80        61
   macro avg       0.80      0.80      0.80        61
weighted avg       0.80      0.80      0.80        61

/usr/local/lib/python3.7/dist-packages/sklearn/base.py:444: UserWarning:

X has feature names, but SVC was fitted without feature names

In [146]:
X_embedded = TSNE(n_components=2, learning_rate='auto', init='random').fit_transform(train_X)
model_svm.fit(X_embedded - X_embedded.min(), train_y)

fig, ax = plt.subplots()
plot_decision_regions(X_embedded, train_y.to_numpy(), clf=model_svm, legend=2, ax=ax)

fig.suptitle('SVM on heart disease')
plt.show()
/usr/local/lib/python3.7/dist-packages/mlxtend/plotting/decision_regions.py:244: MatplotlibDeprecationWarning:

Passing unsupported keyword arguments to axis() will raise a TypeError in 3.3.

Stacking

In [147]:
accuracies = [model_logistic.fit(train_X, train_y).score(test_X, test_y), model_dt.fit(train_X, train_y).score(test_X, test_y),
              model_rf.fit(train_X, train_y).score(test_X, test_y), model_xg.fit(train_X, train_y).score(test_X, test_y),
              model_nb.fit(train_X, train_y).score(test_X, test_y), model_knn.fit(train_X, train_y).score(test_X, test_y),
              model_nn.fit(normalized_train_X, normalized_train_y).score(normalized_test_X, normalized_test_y),
              model_svm.fit(normalized_train_X, normalized_train_y).score(normalized_test_X, normalized_test_y)]

estimators = [('logistic', model_logistic), ('dt', model_dt),
              ('rf', model_rf), ('xg', model_xg),
              ('nb', model_nb), ('knn', model_knn),
              ('nn', model_nn), ('svm', model_svm)]

stackingClassifier = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())

stackingAccuracy = stackingClassifier.fit(train_X, train_y).score(test_X, test_y)
accuracies.append(stackingAccuracy)
stackingAccuracy
/usr/local/lib/python3.7/dist-packages/sklearn/neural_network/_multilayer_perceptron.py:696: ConvergenceWarning:

Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.

Out[147]:
0.8688524590163934

Accuracy comparison

In [149]:
classifiers = ['Logistic Regression', 'Decision Tree', 'Random Forest',
               'XGBoost', 'Naive Bayes', 'K-nn', 'Neural Networks', 'SVM', 'Stacking Classifier']

fig = go.Figure()
fig.add_trace(go.Bar(x=classifiers, y=np.array(accuracies)))
fig.update_layout(title="Accuracies barplot")
fig.show(renderer="notebook")

Conclusions

The genetic algorithm manages to find some well adjusted hyperparameters where the algorithms procude good results most of the time. However, since the dataset only contains about 300 entries from where we cut an additional 20% for the test set, there is not that much room to improve the results.

Since the GA performs cross-validation, the dataset is even smaller while searching for hyperparameters' values. That is why in different runs we obtain different results, sometimes worse than those from an algorithm without tuned parameters. Additionally, the fitness function defined for the GA takes into account the sum of the accuracy, precision and recall, which means that a lower accuracy might have been balanced by a higher recall and so on.

On a previous run we got the following results:

  • Logistic Regression performed better on the test set with the tuned hyperparameters (90% vs 88% accuracy), while the train accuracy decreased slightly (84% vs 85%). However, that is what we want from the model: not to overfit and generalize better.
  • Decision trees had better results on the test set (78% vs 75% accuracy) when using the tuned hyperparameters, while the train set accuracy was lower (87% vs 100%). That is to be expected, since decision trees overfit the data and that is why they achieve 100% train accuracy.
  • Random forest performed in a similar manner to Decision trees, which is normal since it internally uses smaller decision trees to learn. They obtained 90% test accuracy, similar to the Logistic Regression.
  • XGBoost overfitted the training data when not using tuned hyperparameters, while the tuned version obtained higher test accuracy (83% vs 81%).
  • Naive Bayes obtained the same results in the standard form as with tuned hyperparameters (87% test and 77% train accuracy).
  • K-NN obtained the same results with the tuned hyperparameters as the one with the default ones.
  • Neural network's achieved 78% accuracy on the test set, a bit better than the 75% one the one with the default parameters. The train accuracy is also higher (95% vs 91%).
  • SVM had worse results on both the train and the test sets with the tuned hyperparameters. Again, this is because of the cross-validation and the fitness function.

We conclude that the Logistic Regression and the Random Forest were the best performing algorithms on this dataset for this run. However, as stated above, this results vary between runs, given the stochastic nature of the train test split and of the hyperparameters values chosen by the GA.